[Mellanox]Correct reboot cause when reboot via power cycle#15
Closed
[Mellanox]Correct reboot cause when reboot via power cycle#15
Conversation
… rebooted by software.
keboliu
reviewed
Oct 11, 2019
Collaborator
keboliu
left a comment
There was a problem hiding this comment.
suggest not to ignore it but treat it as "none hardware"
Also check reboot cause file "reset_sw_reset" which indicates the system was rebooted due to software requesting.
Owner
Author
Fixed. |
Owner
Author
|
Community PR opened. |
stephenxs
pushed a commit
that referenced
this pull request
Jun 10, 2020
…sonic-net#4702) Update AS7312-54X,AS7312-54XS,AS7315-27XB config.bcm file to make sure there is no the following error message. configuration: format error in /usr/share/sonic/hwsku/th-as7312-48x25G+6x100G.config.bcm on line 110 (ignored)#15
stephenxs
pushed a commit
that referenced
this pull request
Dec 10, 2020
This update brings in the following commits. 86c1108 Enable arm architecture to build in addition to amd64 (#37) 4acb2c3 fix bugs and enhance Transformer (#35) 49e5a22 ygot related enhancements and fixes (#34) 51224de Fix ietf yang search path for cvl schema builds (#32) 3c6cdb3 CVL Changes #8: 'must' and 'when' expression evaluation (#31) dabf231 CVL Changes #7: 'leafref' evaluation (#28) 6f9535f CVL Changes #6: Customized Xpath Engine integration (#27) 5e2466b DB-Layer fixes/enhancements (#26) 9a27302 CVL Changes #4: Implementation of new CVL APIs (#22) dbf1093 Translib support for authorization, yang versioning and Delete flag (#21) 80f369e CVL Changes #5: YParser enhancement (#23) 904ce18 CVL Changes #3: Multi-db instance support (#20) 9d24a34 CVL Changes #2: YValidator infra changes for evaluating xpath expression (#19) f3fc40f CVL Changes #1: Initial CVL code reorganization and common infra changes (#18) 4922601 Bulk and RPC API support in translib (#16) 1d730df RFC7895 yang module library implementation (#15)
stephenxs
pushed a commit
that referenced
this pull request
Feb 24, 2021
…tool: not found (sonic-net#6615) Starting with BRCM SAI 4.3.1.5 we see the following :ethtool not fount" error in syslog during boot up: ``` Jan 27 07:36:14.712472 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1: Jan 27 07:36:14.712844 str-s6100-acs-1 INFO syncd#/supervisord: syncd ethtool: not found Jan 27 07:36:14.713228 str-s6100-acs-1 INFO syncd#/supervisord: syncd #15 Jan 27 07:36:14.713840 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet39 speed 40000 rc:32512 Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet39 Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet39 Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initializePort: Initializing port alias:Ethernet36 pid:1000000000040 Jan 27 07:36:14.726793 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet36 admin:0 oper:0 addr:4c:76:25:f5:48:80 ifindex:75 master:0 Jan 27 07:36:14.727967 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet36(ok) to state db Jan 27 07:36:14.729331 str-s6100-acs-1 NOTICE swss#orchagent: :- addHostIntfs: Create host interface for port Ethernet36 Jan 27 07:36:14.752398 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1: ethtool: not found#015 Jan 27 07:36:14.752689 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet36 speed 40000 rc:32512 Jan 27 07:36:14.756050 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet36 Jan 27 07:36:14.757585 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet36 ``` It seems that starting with BRCM SAI 4.2.1.5 syncd is using ethtool to set the host interface speed and since this ethtool was not part of the syncd Docker, we observe these "ethtool not found" issue.
stephenxs
pushed a commit
that referenced
this pull request
Mar 19, 2021
…tool: not found (sonic-net#6615) Starting with BRCM SAI 4.3.1.5 we see the following :ethtool not fount" error in syslog during boot up: ``` Jan 27 07:36:14.712472 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1: Jan 27 07:36:14.712844 str-s6100-acs-1 INFO syncd#/supervisord: syncd ethtool: not found Jan 27 07:36:14.713228 str-s6100-acs-1 INFO syncd#/supervisord: syncd #15 Jan 27 07:36:14.713840 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet39 speed 40000 rc:32512 Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet39 Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet39 Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initializePort: Initializing port alias:Ethernet36 pid:1000000000040 Jan 27 07:36:14.726793 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet36 admin:0 oper:0 addr:4c:76:25:f5:48:80 ifindex:75 master:0 Jan 27 07:36:14.727967 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet36(ok) to state db Jan 27 07:36:14.729331 str-s6100-acs-1 NOTICE swss#orchagent: :- addHostIntfs: Create host interface for port Ethernet36 Jan 27 07:36:14.752398 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1: ethtool: not found#015 Jan 27 07:36:14.752689 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet36 speed 40000 rc:32512 Jan 27 07:36:14.756050 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet36 Jan 27 07:36:14.757585 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet36 ``` It seems that starting with BRCM SAI 4.2.1.5 syncd is using ethtool to set the host interface speed and since this ethtool was not part of the syncd Docker, we observe these "ethtool not found" issue.
stephenxs
pushed a commit
that referenced
this pull request
Feb 10, 2022
* [BFN] Updated platform APIs impl Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Extended BFN platform SFP APIs implementation * Update sfp.py * [BFN] Extended SFP platform plugin implementation Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [BFN] Extended Fans platform plugin implementation * [BFN] divided classes Fan and FanDrawer into 2 files * Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> What I did Add get_model() function Add get_low_critical_threshold() function Change __get(...) function. How I did it Differnece from previous implementation of __get(...) function is return real value or -9999.9 if value is not provided by thrift API * Add get_presence() function and revised __get() function Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> * [BFN] Updated PSU platform APIs impl Signed-off-by: Dmytro Lytvynenko <dmytrox.lytvynenko@intel.com> * Added BFN PSU cache (#9) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [BFN] Fans and Fantray platform APIs update (#7) * [BFN] Updated SFP platform APIs (#10) Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * [BFN] Updated platform API for thermal (#8) * Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> * Revert "[BFN] Fans and Fantray platform APIs update (#7)" (#11) This reverts commit c62a733. * Add support health monitor system (#15) Signed-off-by: Petro Bratash <petrox.bratash@intel.com> * Update chassis.py * [BFN] Updated FANs and FAN Tray platform API (#14) * Fix fix_alignment (#17) Signed-off-by: Petro Bratash <petrox.bratash@intel.com> * [BFN] Improvement show environment (#16) * Added PSU temperature skip into platform.json (#18) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Do not skip psud on Newport Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [BFN] fix fan status from Not OK to Ok (#19) * [BFN] Updated SFP platform plugin (#13) Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * [DPB] Fix typo for Ethernet0 2x200G[100G,40G] breakout mode (#21) Signed-off-by: Mykola Gerasymenko <mykolax.gerasymenko@intel.com> * [barefoot] Tmp fix vendor_rev (#22) Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * Fixed python issues in sonic_platform/fan_drawer.py Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Updated fan_drawer.py * Fixing trailing white spaces in fan_drawer.py * [BFN] Fix thrift for SFPs API Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [Newport] Thermal manager (#23) * Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> * Revert "In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue" This reverts commit 1e73127. * Removed 'controllable' options from platform.json to fix factory default config generation Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Update thermal_manager.py * Migrated SFP plugin to sonic_xcvr API (#30) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> Co-authored-by: KostiantynYarovyiBf <kostiantynx.yarovyi@intel.com> Co-authored-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> Co-authored-by: Dmytro Lytvynenko <dmytrox.lytvynenko@intel.com> Co-authored-by: Volodymyr Boiko <volodymyrx.boiko@intel.com> Co-authored-by: Petro Bratash <petrox.bratash@intel.com> Co-authored-by: Mykola Gerasymenko <mykolax.gerasymenko@intel.com>
stephenxs
pushed a commit
that referenced
this pull request
Feb 10, 2022
[sonic-linkmgrd][master] submodule update Commits added: 0c23756 Jing Zhang 2022-01-19 Linkmgrd subscribing State DB route event (#13) 12b9951 Longxiang Lyu 2021-12-13 Add TLV support to ICMP payload (#11) 3eedda3 Longxiang Lyu 2022-01-06 Add missing intermediate states (#16) 8da4982 Ying Xie 2022-01-04 [linkmgrd] update README, set coding style guidance (#15) a897cf8 Longxiang Lyu 2021-12-13 Improve PR template (#16) 6fec701 Jing Zhang 2021-12-06 Add pull request template for linkmgrd repo (#9) signed-off-by: Jing Zhang zhangjing@microsoft.com
stephenxs
pushed a commit
that referenced
this pull request
Mar 19, 2022
[sonic-linkmgrd][master] submodule update Commits added: 0c23756 Jing Zhang 2022-01-19 Linkmgrd subscribing State DB route event (#13) 12b9951 Longxiang Lyu 2021-12-13 Add TLV support to ICMP payload (#11) 3eedda3 Longxiang Lyu 2022-01-06 Add missing intermediate states (#16) 8da4982 Ying Xie 2022-01-04 [linkmgrd] update README, set coding style guidance (#15) a897cf8 Longxiang Lyu 2021-12-13 Improve PR template (#16) 6fec701 Jing Zhang 2021-12-06 Add pull request template for linkmgrd repo (#9) signed-off-by: Jing Zhang zhangjing@microsoft.com
stephenxs
pushed a commit
that referenced
this pull request
Nov 24, 2023
SAI 9.x requires a SYNCD_SHM_SIZE specified otherwise it will default to 64mb which is insufficient for syncd. E.G. of a few failures seen when insufficient shmem was set ha_init: The file: warmboot_data_0 is of size=762[MB] and is beyond the directory: /dev/shm available storage of size=64[MB]#15 syncd.sh[26074]: Cannot get SYNCD_SHM_SIZE for chip: [869] in /usr/share/sonic/device/x86_64-broadcom_common/syncd_shm.ini. Skip set SYNCD_SHM_SIZE. Syncd hangs here: syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_shr_ha_section_resize:536 start=0x7f6e641b4000, end=0x7f6e645b4000, len=302276608, free=0x7f6e641b4000 Broadcom recommended using 1gb for DNX devices. Since currently we don't use SAI9.x on master and 202305 this change won't fix anything until we upgrade the SAI on those branches.
stephenxs
pushed a commit
that referenced
this pull request
Aug 29, 2024
…-net#19637) Broadcom requires that programmability_ucode_relative_path is set in SAI11. This soc property replaces the legacy custom_feature_ucode_path Without this we get the following error: syncd#supervisord: syncd 0:dbx_file_get_db_location: DB Resource not defined#015 syncd#supervisord: syncd #15#015 syncd#supervisord: syncd 0:dnx_init_pemla_get_ucode_filepath: Error 'Invalid parameter' indicated ; #15#015
stephenxs
pushed a commit
that referenced
this pull request
Nov 11, 2024
…7250E platform (sonic-net#20367) Update sonic-platform submodule for Nokia-IXR7250E: Fixes Nokia-ION/ndk#57 cdfbbe2 [H4-32D]Update platform modules after OC tests (Update README.md #17) f28eff0 [H4-64D]Fix SFP+ port, eeprom, reboot-cause, thermal algorithm, add PSU input voltage check (Fix rules in Makefiles #15) 178e15a Minor watchdog change for better retention of last kick stamp c479392 Remove rogue platform_reboot file 331abe0 Enhance watchdog script to detect fsde device hung signature 4c6b7c1 Fixed update temperature issue 5002fb7 Remove average and maximum c620130 No PSU Master status led in IMM. No need to set it Signed-off-by: mlok <marty.lok@nokia.com>
stephenxs
pushed a commit
that referenced
this pull request
Nov 27, 2024
…ly (sonic-net#20847) #### Why I did it src/sonic-bmp ``` * bfbd47b - (HEAD -> master, origin/master, origin/HEAD) Merge pull request #15 from FengPan-Frank/makefile (13 hours ago) [Feng-msft] * ad31f5b - Create makefile for build image flow (17 hours ago) [Feng Pan] ``` #### How I did it #### How to verify it #### Description for the changelog
stephenxs
pushed a commit
that referenced
this pull request
Mar 9, 2026
…net#25643) * [build] Add build timing report and dependency analysis tools Add three scripts for build performance instrumentation: - scripts/build-timing-report.sh: Parse per-package timing from build logs (HEADER/FOOTER timestamps), generate sorted duration table, phase breakdown, parallelism timeline, and CSV export. - scripts/build-dep-graph.py: Parse rules/*.mk dependency graph, compute critical path, fan-out/fan-in bottleneck analysis, and generate DOT/JSON output for visualization. - scripts/build-resource-monitor.sh: Sample CPU, memory, disk I/O, and Docker container count during builds for resource utilization analysis. Add "make build-report" target to slave.mk that runs the timing report and dependency analysis after a build completes. Example output from a VS build on 24-core/30GB machine: - 210 packages built in 53m wall time (173m CPU) - Max concurrency: 5 (with SONIC_CONFIG_BUILD_JOBS=4) - Critical path: 14 packages deep (libnl -> libswsscommon -> utilities) - Top bottleneck: LIBSWSSCOMMON with 48 downstream dependents Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> * Address Copilot review: fix 17 bugs in build analysis scripts - Use free -m with division instead of free -g to avoid rounding (#1) - Add = and ?= to Makefile dependency regex patterns (#2, #7) - CPU calculation now uses /proc/stat delta (two reads) (#3, #14) - Fix misleading 'critical path estimate' comment (#4) - Fix parallelism timeline comment (60s not 10s) (#5) - Include after-relationship packages in fan stats (#6) - Guard disk I/O division by zero when INTERVAL<=1 (#8) - Remove unused elapsed_line variable (#9) - Remove redundant LIBSWSSCOMMON_DBG check (#10) - Remove active_make_jobs from CSV header comment (#11) - Wire up _RDEPENDS parsing to build reverse deps (#12) - Remove unnecessary 'if v' filter on rdeps JSON (#13) - Remove unused REPORT_FORMAT parameter (#15) - Add cycle detection to critical path algorithm (#16) - Add execute permission check for companion scripts (#17) Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> --------- Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
stephenxs
pushed a commit
that referenced
this pull request
Mar 30, 2026
…dating udevd rules (sonic-net#26343) - Why I did it On SONiC SmartSwitch platforms with DPUs, systemd-udevd crashes with SIGABRT on every reboot when DPU firmware initialization is slow. During the initramfs boot phase, a standalone systemd-udevd daemon is started to handle device discovery. If DPU firmware takes longer than the 60-second udevadm settle timeout (BlueField-3 DPUs can take 120 seconds each in the failure case when they are stuck), the initramfs cannot stop this udevd before switch_root. The stale process survives into the real system but is never chrooted into the overlayfs root, leaving it with a broken filesystem view. When dpu-udev-manager.sh writes udev rules, the stale udevd detects the change and crashes on an assertion in systemd's chase() path resolution (assert(path_is_absolute(p)) at chase.c:648), because dir_fd_is_root() returns false for a process whose root still points to the initramfs rootfs rather than the overlayfs. This triggers a systemd issue : systemd/systemd#29559 which maintainers doesn't consider as a bug from systemd side. Raising this fix for our usecase. Core was generated by `/usr/lib/systemd/systemd-udevd --daemon --resolve-names=never'. Program terminated with signal SIGABRT, Aborted. #0 0x00007f29fe7f695c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007f29fe7f695c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f29fe7a1cc2 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f29fe78a4ac in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007f29fea50c11 in ?? () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #4 0x00007f29feb94a8b in chase () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #5 0x00007f29feb956e2 in chase_and_opendir () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #6 0x00007f29feb9a609 in conf_files_list_strv () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #7 0x00007f29fea913e8 in config_get_stats_by_path () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #8 0x0000559f295519cf in ?? () #9 0x0000559f29553a77 in ?? () #10 0x00007f29fec36055 in ?? () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #11 0x00007f29fec3668d in sd_event_dispatch () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #12 0x00007f29fec394a8 in sd_event_run () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #13 0x00007f29fec396c7 in sd_event_loop () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so #14 0x0000559f29545820 in ?? () #15 0x00007f29fe78bca8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #16 0x00007f29fe78bd65 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 #17 0x0000559f29545c51 in ?? () - How I did it Added a kill_stale_udevd() function to dpu-udev-manager.sh that runs before writing the udev rules. It identifies the systemd-managed udevd PID via systemctl show, then kills any other systemd-udevd --daemon process that doesn't match -- these are leftover initramfs instances. If no stale process exists (e.g. DPUs are healthy and the initramfs udevd exited cleanly), the function is a no-op. - How to verify it Deploy the image on a SmartSwitch with DPUs in a state where firmware initialization times out (>60s per DPU) by stopping image installation before firmware install step Reboot the switch Verify no new systemd-udevd coredumps in /var/core/ Verify the stale process was killed: journalctl -b 0 | grep dpu-udev-manager should show killing stale initramfs udevd PID (systemd udevd is PID ) Verify systemd-udevd.service is healthy: systemctl status systemd-udevd should show active (running) Verify DPU udev rules were written: cat /etc/udev/rules.d/92-midplane-intf.rules should contain the DPU interface naming rules Signed-off-by: Hemanth Kumar Tirupati <tirupatihemanthkumar@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ignore reset_sw_reset which indicates rebooted by software.
- What I did
Correct reboot cause when reboot via power cycle by ignoring reset_sw_reset.
process-reboot-cause has the following logic:
Currently we implement "reboot" command via using power cycle. After a power-cycle reboot the reset_sw_reset contains a "1". Originally chassis reboot-cause handling treats it as a hardware-caused reboot and returns it to process-reboot-cause. In this case, the process-reboot-cause treats it as the reboot cause which is wrong.
To resolve that, just ignore reset_sw_reset.
- How I did it
Ignore reset_sw_reset
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)